This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

**************************************

PART FOUR: CLASS EXERCISES *

**************************************

Jon Sadler - updated Mar 10th 2022

load in Libraries

library(car)
Loading required package: carData
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL)  :
  non-uniform 'Rounding' sampler used
── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.6     ✓ dplyr   1.0.8
✓ tidyr   1.2.0     ✓ stringr 1.4.0
✓ readr   2.1.2     ✓ forcats 0.5.1
Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL)  :
  non-uniform 'Rounding' sampler used
── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
x dplyr::recode() masks car::recode()
x purrr::some()   masks car::some()
library(ggfortify)
library(scales) # You need this library to to access break formatting functions for log axes

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

#1. Linear Regression using the faithful dataset # look at the data in the base file

str(faithful)
'data.frame':   272 obs. of  2 variables:
 $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
 $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
faithful
View(faithful)

plot some pictures

scatterplot(eruptions ~ waiting, data = faithful)

Looks decent in terms of the boxplots - indicates normality and homogeneity of variance

Maybe an issue with linearity of the response to explanatory but we’ll have a look

Using ggplot

ggplot(faithful,aes(waiting,eruptions)) + geom_point() + geom_smooth(method=lm)
`geom_smooth()` using formula 'y ~ x'

Specify and examine linear model

summary(lm(eruptions ~ waiting, data = faithful))

Call:
lm(formula = eruptions ~ waiting, data = faithful)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
waiting      0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
anova(lm(eruptions ~ waiting, data = faithful))
Analysis of Variance Table

Response: eruptions
           Df  Sum Sq Mean Sq F value    Pr(>F)    
waiting     1 286.478 286.478  1162.1 < 2.2e-16 ***
Residuals 270  66.562   0.247                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

write the model to an object

faithful.lm <- lm(eruptions ~ waiting, data = faithful)

look at plots for validation

op <- par(mfrow = c(2, 2)) 
plot(faithful.lm)
par(op)

QQ-plot indicates normality

residuals v fitted suggests limited heterogeneity that is a result of two clumps of points, but it the residuals are evenly spread either side of the zero line.

Scale v location indicates homogeneity

Leverage plot indicates a few issues. We’ll leave it and come back to how to deal with this # in subsequent classes

same plots in ggfortify

autoplot(faithful.lm)

Last validation task; plot residuals against explanatory

plot(faithful.lm$resid ~ faithful$waiting,
     xlab = "Waiting time (mins)",
     ylab = "Model residuals")

Generally fine so we’ll go with it for the time being.

Decision will run with a linear model

# We’ll revisit this in a week or so to run another regression technique later in the course # plot model and add R-square etc

plot(eruptions ~ waiting, data = faithful,
     xlab = "Waiting time between eruptions (mins)",
     ylab = "Eruption duration (mins)",
     pch = 20, col = "grey", bg = "grey") # We've left the axes on
# add the regression line from the model (faithful.lm) using abline.....
abline(faithful.lm, col="black")
# add the equation
text(99,2, "eruptions = 0.0756waiting + -1.874", pos = 2, cex=0.65)
# add r-square value
text(99,1.75, expression(paste(R^2 == 0.8108)), pos = 2, cex = 0.6) # add in text using the expression/paste functions
# create a sequence of 1000 number spanning the range of humidities (min to max)
x <- seq(min(faithful$waiting), max(faithful$waiting), l=1000)  # notice this is an 'l' = length. NOT a 1!!!!!!!
#for each value of x, calculate the upper and lower 95% confidence
y<-predict(faithful.lm, data.frame(waiting=x), interval="c")
#plot the upper and lower 95% confidence limits
matlines(x,y, lty=3, col="black") # This function add the CIs, lty = line type (dashed)

plot with ggplot

p <- ggplot(faithful, aes(x=waiting, y=eruptions)) + 
  geom_point(col = "grey") +
  geom_smooth(method=lm, col = "black") + 
  annotate(geom = "text", x = 50, y = 6,
           label = "Eruptions = -0.076Waiting + -1.87\nAdj. R2 = 0.8108",
           hjust = 0)
p + xlab("Waiting time between eruptions (mins)") + ylab("Eruption duration (mins)") + theme_bw()
`geom_smooth()` using formula 'y ~ x'

Predict and plot with ggplot

add ‘fit’, ‘lwr’, and ‘upr’ columns to dataframe (generated by predict)

old.predict <- cbind(faithful, predict(faithful.lm, interval = 'confidence'))

plot the points (actual observations), regression line, and confidence interval

p <- ggplot(old.predict, aes(waiting,eruptions))
p <- p + geom_point()
p <- p + geom_line(aes(waiting, fit))
p <- p + geom_ribbon(aes(ymin=lwr,ymax=upr), alpha=0.3)
p

2. Mussel dataset using abundance as a response variable

load data file

Mussel <- read.csv(file.choose())

Look at it

glimpse(Mussel)
Rows: 25
Columns: 3
$ AREA    <dbl> 516.00, 469.06, 462.25, 938.60, 1357.15, 1773.66, 1686.01, 1786.29, 30…
$ SPECIES <int> 3, 7, 6, 8, 10, 9, 10, 11, 16, 9, 13, 14, 12, 14, 20, 22, 15, 20, 22, …
$ INDIV   <int> 18, 60, 57, 100, 48, 118, 148, 214, 225, 283, 380, 278, 338, 274, 569,…

Plot it to check for linearity

scatterplot(INDIV ~ AREA, data = Mussel)

This indicates that the data are not normaly disributed (especially AREA)

The abundance data don’t look too good either. Huge peak in the middle

might be outliers in both!!!!

Let’s fit a linear model nonetheless

mussel.lm <- lm(SPECIES ~ AREA, data = Mussel)
# look at it
summary(mussel.lm)

Call:
lm(formula = SPECIES ~ AREA, data = Mussel)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1964 -2.7521 -0.7509  1.2094  7.2148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 9.856e+00  1.044e+00   9.441 2.24e-09 ***
AREA        6.593e-04  9.283e-05   7.102 3.10e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.759 on 23 degrees of freedom
Multiple R-squared:  0.6868,    Adjusted R-squared:  0.6732 
F-statistic: 50.44 on 1 and 23 DF,  p-value: 3.102e-07

Now check assumption by using R’s inbuilt model validation plot defaults

set graphics parameters because we want all the plots on one graphic

autoplot(mussel.lm)

Residuals v Fitted indicate a problem. It’s wedge shaped and humped!

qqplot is a bit dodgy but might be okay

Scale-Location plot indicates a wedge

Cook distance / leverage looks okay - but there are some large values in area

FINAL VALIDATION TASK - residuals against explanatory variable

plot(mussel.lm$resid ~ Mussel$AREA,
     xlab = "Mussel bed Area", 
     ylab = "Residuals")

Some code for ggplot to do the same thing…..

ggplot(fortify(mussel.lm, Mussel), aes(AREA, .stdresid)) +
  geom_point() + geom_smooth(se = TRUE)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

This indicates a few large values and a slight wedge due to numerous small patches

Conclusion we reject the model

So what do we do?

We can linearise the explanatory variables by transforming them and re-run the model

I am not a fan of this - we’ll look at other approaches next week!

and repeat the validation

Let’s fit a linear model nonetheless

mussel.lm1 <- lm(SPECIES ~ log10(AREA), data = Mussel)
# look at it
summary(mussel.lm1)

Call:
lm(formula = SPECIES ~ log10(AREA), data = Mussel)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7204 -1.7227  0.3603  1.8136  4.2430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -25.731      3.777  -6.812 6.02e-07 ***
log10(AREA)   11.226      1.030  10.895 1.48e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.707 on 23 degrees of freedom
Multiple R-squared:  0.8377,    Adjusted R-squared:  0.8306 
F-statistic: 118.7 on 1 and 23 DF,  p-value: 1.481e-10

Now check assumption by using R’s inbuilt model validation plot defaults

set graphics parameters because we want all the plots on one graphic

autoplot(mussel.lm1)

Residuals v Fitted are better but there is still a bit of wedge at higher fitted values

qqplot is a better

Scale-Location plot indicates a slight wedge with higher values

Cook distance / leverage looks improved

FINAL VALIDATION TASK - residuals against explanatory variable

ggplot(fortify(mussel.lm1, Mussel), aes(AREA, .stdresid)) +
  geom_point() + geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

This indicates a few large values and a slight wedge due to numerous small patches

Conclusion we reject the model

Try log10 on the response - not unusual with abundance data.

There are no zeros so we don’t need an offset of 0.01 or something similar

Let’s fit a linear model nonetheless

mussel.lm2 <- lm(log10(INDIV) ~ log10(AREA), data = Mussel)
# look at it
summary(mussel.lm2)

Call:
lm(formula = log10(INDIV) ~ log10(AREA), data = Mussel)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43355 -0.06464  0.02219  0.11178  0.26818 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.57601    0.25904  -2.224   0.0363 *  
log10(AREA)  0.83492    0.07066  11.816 3.01e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1856 on 23 degrees of freedom
Multiple R-squared:  0.8586,    Adjusted R-squared:  0.8524 
F-statistic: 139.6 on 1 and 23 DF,  p-value: 3.007e-11

Now check assumption

autoplot(mussel.lm2)

Residuals v Fitted are much better

qqplot is okay - slight issue on lower values

Scale-Location plot is fine

Cook distance / leverage looks okay

FINAL VALIDATION TASK - residuals against explanatory variable

ggplot(fortify(mussel.lm2, Mussel), aes(AREA, .stdresid)) +
  geom_point() 

This is better but still a slight hump

Conclusion we accept the model

plot the final model in ggplot…

p <- ggplot(Mussel, aes(AREA,INDIV)) + 
  geom_point(col = "grey") +
  geom_smooth(method=lm, col = "black") + 
  annotate(geom = "text", x = 1000, y = 30,
           label = "log10(INDIV) = 0.835 + log10(AREA) - 0.576\n[Adj. R2 = 0.8108]", hjust = 0)  # adds text (NOTE \n means start new line)
p + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + # These commands scale the axes
  annotation_logticks(sides="lb") # This adds log10 tick marks
`geom_smooth()` using formula 'y ~ x'

NOTE: the log axes use the Scales library [see above]

#Plot the graph add equations in base R

plot(log10(INDIV) ~ log10(AREA), data = Mussel,
     xlab = "Log Mussel clump area (mm2)",
     ylab = "Log Number of Individuals",
     pch = 21, col = "grey", bg = "grey") # We've left the axes on
# add the regression line from the model (faithful.lm) using abline.....
abline(mussel.lm2, col="black")
# add the equation
text(4.0, 1.4, expression(paste(Log[10], "INDIV = 0.835", log[10], "AREA - 0.576", pos = 2)), cex = .7)
# add r-square value
text(4.0, 1.3, expression(paste(R^2 == 0.852)), cex = .7) # add in text using the expression/paste functions
# create a sequence of 1000 number spanning the range of humidities (min to max)
x <- seq(min(Mussel$AREA), max(Mussel$AREA), l=1000)  # notice this is an 'l' = length. NOT a 1!!!!!!!
#for each value of x, calculate the upper and lower 95% confidence
y<-predict(mussel.lm2, data.frame(AREA=x), interval="c")
#plot the upper and lower 95% confidence limits
matlines(log10(x),y, lty=3, col="black") # This function add the CIs, lty = line type (dashed) 

add in ggplot code….

p <- ggplot(Mussel, aes(x=log10(AREA), y=log10(INDIV))) + 
  geom_point(col = "blue") +
  geom_smooth(method=lm, col = "black") + 
  annotate(geom = "text", x = 2.5, y = 3,
           label = "Log10INDIV = 0.835 + log10AREA - 0.576\nAdj. R2 = 0.852",
           hjust = 0)
p + xlab("Log Mussel clump area (mm2)") + ylab("Log number of individuals")
`geom_smooth()` using formula 'y ~ x'

3. ANCOVA on compensation dataset

Just so you have a frame of reference going forward;

the compensation data are about the production of fruit (apples, kg)

on rootstocks of different widths (mm; the tops are grafted onto rootstocks).

Furthermore, some trees are in parts of the orchard that allow grazing by cattle,

and others are in parts free from grazing. Grazing may reduce the amount of grass,

which might compete with the apple trees for resources.

Load file

Compensation <- read.csv(file.choose())

look at data

str(Compensation)
'data.frame':   40 obs. of  3 variables:
 $ Root   : num  6.22 6.49 4.92 5.13 5.42 ...
 $ Fruit  : num  59.8 61 14.7 19.3 34.2 ...
 $ Grazing: Factor w/ 2 levels "Grazed","Ungrazed": 2 2 2 2 2 2 2 2 2 2 ...
Compensation$Grazing <- as.factor(Compensation$Grazing) # We need this to be a factor for an ANCOVA

Plot it

# Plot it....
plot(Fruit ~ Root, data = Compensation, pch = 19,cex = 1.5,
     col = c("Black", "Red")[Compensation$Grazing],
     xlab = list("Root biomass (mm)", cex = 1.2),
     ylab = list("Fruit Production (kg)", cex = 1.2))
# Now add a legend
legend(5, 115, legend = c("Grazed", "Ungrazed"),
       col = c("black", "red"), pch = 19)

Or in coplot

coplot(Root~ Fruit | Grazing, data =Compensation) # Grazed plants have higher root biomass. 

You can do this is lattice too

boxplot(Root ~ Grazing, data = Compensation) # These look fine

boxplot(Fruit ~ Grazing, data = Compensation)# These look fine

You might also do some QQ-plots here…..or plot to check homogeneity of variances

but we’ll leave that for validation….

Run ANCOVA

Compensation.lm <- lm(Fruit ~  Grazing*Root, data=Compensation) 

Validate model to check assumptions

autoplot(Compensation.lm)

This looks pretty decent. Slight concern over QQ-plot but it’s not a massive problem.

#Look at model outcomes

anova(Compensation.lm)
Analysis of Variance Table

Response: Fruit
             Df  Sum Sq Mean Sq  F value    Pr(>F)    
Grazing       1  2910.4  2910.4  62.3795 2.262e-09 ***
Root          1 19148.9 19148.9 410.4201 < 2.2e-16 ***
Grazing:Root  1     4.8     4.8   0.1031      0.75    
Residuals    36  1679.6    46.7                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Anova shows that both Root and Fruit are significant terms

But that the interaction isn’t….

summary(Compensation.lm) 

Call:
lm(formula = Fruit ~ Grazing * Root, data = Compensation)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3177  -2.8320   0.1247   3.8511  17.1313 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -125.173     12.811  -9.771 1.15e-11 ***
GrazingUngrazed        30.806     16.842   1.829   0.0757 .  
Root                   23.240      1.531  15.182  < 2e-16 ***
GrazingUngrazed:Root    0.756      2.354   0.321   0.7500    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.831 on 36 degrees of freedom
Multiple R-squared:  0.9293,    Adjusted R-squared:  0.9234 
F-statistic: 157.6 on 3 and 36 DF,  p-value: < 2.2e-16

Data indicate fruit differs by grazing

No interaction term i.e. slope for factor Grazing are the same

Look at the example in the lecture 6 to figure out the intercept and slope differences, but here we go:

The intercept (first row in the summary output) is the intercept for grazed samples because grazed (i.e G)

comes before ungrazed (i.e. U) in the alphabet and R uses an alphanumeric call on factor data.

The 2nd line GrazingUngrazed is the difference in the intercepts. So Ungrazed is 30.806 larger. Meaning 30.8 more seeds are produced by grazed plants

Calculate the intercept for Ungrazed

-125.173  + 30.806 # = -94.367
[1] -94.367

the third line is slope for root

the fourth line GrazingUngrazed:Root is the difference in slopes between Grazed and Ungrazed = 0.756, which is tiny

the coefficients show this….

coef(Compensation.lm)
         (Intercept)      GrazingUngrazed                 Root GrazingUngrazed:Root 
        -125.1730569           30.8057049           23.2403732            0.7560338 

Now lets add lines to the graphs. We can do it many ways - so here is another one!

Route one - using abline

plot(Fruit ~ Root, data = Compensation, pch = 19,cex = 1.5,
     col = c("Black", "Red")[Compensation$Grazing],
     xlab = list("Root biomass (mm)", cex = 1.2),
     ylab = list("Fruit production (kg)", cex = 1.2))
# Now add a legend
legend(5, 115, legend = c("Grazed", "Ungrazed"),
       col = c("black", "red"), pch = 19)
# add ablines....we'll do it manually using the slopes and intercepts but you can use coeff7() see below.
abline(-125.173, 23.240)
abline(-125.173+30.805, 23.240+0.756, col= "red")

Plot using coefficients

Compensation.lm$coeff[1] # regression coefficient for grazed intercept
(Intercept) 
  -125.1731 
Compensation.lm$coeff[2] # regression coefficient for grazed slope
GrazingUngrazed 
        30.8057 
Compensation.lm$coeff[3] # regression coefficient for diff in grazed/ungrazed intercept
    Root 
23.24037 
Compensation.lm$coeff[4] # regression coefficient for diff in grazed/ungrazed slope
GrazingUngrazed:Root 
           0.7560338 
plot(Fruit ~ Root, data = Compensation, pch = 19,cex = 1.5,
     col = c("Black", "Red")[Compensation$Grazing],
     xlab = list("Root biomass (mm)", cex = 1.2),
     ylab = list("Fruit Production (kg)", cex = 1.2))
# Now add a legend
legend(5, 115, legend = c("Grazed", "Ungrazed"),
       col = c("black", "red"), pch = 19)
# add ablines....using coefficients - coeff()
abline(coef(Compensation.lm)[1], coef(Compensation.lm)[3]) 
# Now add the summer regression line
abline(coef(Compensation.lm)[1] + coef(Compensation.lm)[2], 
       coef(Compensation.lm)[3] + coef(Compensation.lm)[4], 
       col = "red")

do some prediction….and plot….this is yet another way of doing it

#produce base plot

xs <- seq(0, 12, l = 100)
plot(Fruit ~ Root, data = Compensation, xlab = "Root biomass (mm)", ylab = "Fruit Production (kg)")
# Plot the points and predicted trends (using se)
points(Fruit ~ Root, data = Compensation, subset = Grazing == "Grazed", pch = 19)
pred <- predict(Compensation.lm, type = "response", se = TRUE, newdata = data.frame(Root = xs, Grazing = "Grazed", Root = mean(Compensation$Root)))
lines(pred$fit ~ xs)
points(Fruit ~ Root, data = Compensation, subset = Grazing == "Ungrazed", pch = 19, col = "red")
pred <- predict(Compensation.lm, type = "response", se = TRUE, newdata = data.frame(Root = xs, Grazing = "Ungrazed", Root = mean(Compensation$Root)))
lines(pred$fit ~ xs, col = "red")
legend("topleft", legend = c("Grazed", "Ungrazed"), pch = c(19, 19), col = c("red","black"), title = "Fruit Production", bty = "n")
box(bty = "l")

---
title: "Linear Regression Code walkthrough"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

```{r}
plot(cars)
```

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file). 

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

# **************************************
# PART FOUR: CLASS EXERCISES           *
# **************************************
# Jon Sadler - updated Mar 10th 2022

load in Libraries
```{r}
library(car)
library(tidyverse)
library(ggfortify)
library(scales) # You need this library to to access break formatting functions for log axes
```

#1. Linear Regression using the faithful dataset
# look at the data in the base file
```{r}
str(faithful)
faithful
View(faithful)
```

# plot some pictures
```{r}
scatterplot(eruptions ~ waiting, data = faithful)
```

# Looks decent in terms of the boxplots - indicates normality and homogeneity of variance
# Maybe an issue with linearity of the response to explanatory but we'll have a look

# Using ggplot
```{r}
ggplot(faithful,aes(waiting,eruptions)) + geom_point() + geom_smooth(method=lm)
```

# Specify and examine linear model
```{r}
summary(lm(eruptions ~ waiting, data = faithful))
anova(lm(eruptions ~ waiting, data = faithful))
```
# write the model to an object
```{r}
faithful.lm <- lm(eruptions ~ waiting, data = faithful)
```

# look at plots for validation
```{r}
op <- par(mfrow = c(2, 2)) 
plot(faithful.lm)
par(op)
```
# QQ-plot indicates normality
# residuals v fitted suggests limited heterogeneity that is a result of two clumps of points, but it the residuals are evenly spread either side of the zero line.
# Scale v location indicates homogeneity
# Leverage plot indicates a few issues. We'll leave it and come back to how to deal with this # in subsequent classes

# same plots in ggfortify

```{r}
autoplot(faithful.lm)
```

# Last validation task; plot residuals against explanatory
```{r}
plot(faithful.lm$resid ~ faithful$waiting,
     xlab = "Waiting time (mins)",
     ylab = "Model residuals")
```

# Generally fine so we'll go with it for the time being. 

# Decision will run with a linear model
# We'll revisit this in a week or so to run another regression technique later in the course
# plot model and add R-square etc
```{r}
plot(eruptions ~ waiting, data = faithful,
     xlab = "Waiting time between eruptions (mins)",
     ylab = "Eruption duration (mins)",
     pch = 20, col = "grey", bg = "grey") # We've left the axes on
# add the regression line from the model (faithful.lm) using abline.....
abline(faithful.lm, col="black")
# add the equation
text(99,2, "eruptions = 0.0756waiting + -1.874", pos = 2, cex=0.65)
# add r-square value
text(99,1.75, expression(paste(R^2 == 0.8108)), pos = 2, cex = 0.6) # add in text using the expression/paste functions
# create a sequence of 1000 number spanning the range of humidities (min to max)
x <- seq(min(faithful$waiting), max(faithful$waiting), l=1000)  # notice this is an 'l' = length. NOT a 1!!!!!!!
#for each value of x, calculate the upper and lower 95% confidence
y<-predict(faithful.lm, data.frame(waiting=x), interval="c")
#plot the upper and lower 95% confidence limits
matlines(x,y, lty=3, col="black") # This function add the CIs, lty = line type (dashed)

```
#  plot with ggplot
```{r}
p <- ggplot(faithful, aes(x=waiting, y=eruptions)) + 
  geom_point(col = "grey") +
  geom_smooth(method=lm, col = "black") + 
  annotate(geom = "text", x = 50, y = 6,
           label = "Eruptions = -0.076Waiting + -1.87\nAdj. R2 = 0.8108",
           hjust = 0)
p + xlab("Waiting time between eruptions (mins)") + ylab("Eruption duration (mins)") + theme_bw()
```

# Predict and plot with ggplot

# add 'fit', 'lwr', and 'upr' columns to dataframe (generated by predict)
```{r}
old.predict <- cbind(faithful, predict(faithful.lm, interval = 'confidence'))
```
# plot the points (actual observations), regression line, and confidence interval
```{r}
p <- ggplot(old.predict, aes(waiting,eruptions))
p <- p + geom_point()
p <- p + geom_line(aes(waiting, fit))
p <- p + geom_ribbon(aes(ymin=lwr,ymax=upr), alpha=0.3)
p
```
# 2. Mussel dataset using abundance as a response variable
# load data file
```{r}
Mussel <- read.csv(file.choose())
```
# Look at it
```{r}
glimpse(Mussel)
```

Plot it to check for linearity
```{r}
scatterplot(INDIV ~ AREA, data = Mussel)
```
# This indicates that the data are not normaly disributed (especially AREA)
# The abundance data don't look too good either. Huge peak in the middle
# might be outliers in both!!!!

# Let's fit a linear model nonetheless
```{r}
mussel.lm <- lm(SPECIES ~ AREA, data = Mussel)
# look at it
summary(mussel.lm)
```
# Area looks to significantly related to the number of mussels
# Now check assumption by using R's inbuilt model validation plot defaults
# set graphics parameters because we want all the plots on one graphic
```{r}
autoplot(mussel.lm)
```
# Residuals v Fitted indicate a problem. It's wedge shaped and humped!
# qqplot is a bit dodgy but might be okay
# Scale-Location plot indicates a wedge
# Cook distance / leverage looks okay - but there are some large values in area

# FINAL VALIDATION TASK - residuals against explanatory variable
```{r}
plot(mussel.lm$resid ~ Mussel$AREA,
     xlab = "Mussel bed Area", 
     ylab = "Residuals")
```
# Some code for ggplot to do the same thing.....
```{r}
ggplot(fortify(mussel.lm, Mussel), aes(AREA, .stdresid)) +
  geom_point() + geom_smooth(se = TRUE)
```
# This indicates a few large values and a slight wedge due to numerous small patches
# Conclusion we reject the model

# So what do we do?
# We can linearise the explanatory variables by transforming them and re-run the model
# I am not a fan of this - we'll look at other approaches next week!
# and repeat the validation
# Let's fit a linear model nonetheless
```{r}
mussel.lm1 <- lm(SPECIES ~ log10(AREA), data = Mussel)
# look at it
summary(mussel.lm1)
```
# Now check assumption by using R's inbuilt model validation plot defaults
# set graphics parameters because we want all the plots on one graphic
```{r}
autoplot(mussel.lm1)
```
# Residuals v Fitted are better but there is still a bit of wedge at higher fitted values
# qqplot is a better
# Scale-Location plot indicates a slight wedge with higher values
# Cook distance / leverage looks improved

# FINAL VALIDATION TASK - residuals against explanatory variable
```{r}
ggplot(fortify(mussel.lm1, Mussel), aes(AREA, .stdresid)) +
  geom_point() + geom_smooth(se = FALSE)
```
# This indicates a few large values and a slight wedge due to numerous small patches
# Conclusion we reject the model

# Try log10 on the response - not unusual with abundance data.
# There are no zeros so we don't need an offset of 0.01 or something similar
# Let's fit a linear model nonetheless
```{r}
mussel.lm2 <- lm(log10(INDIV) ~ log10(AREA), data = Mussel)
# look at it
summary(mussel.lm2)
```
# Now check assumption 
```{r}
autoplot(mussel.lm2)
```
# Residuals v Fitted are much better 
# qqplot is okay - slight issue on lower values
# Scale-Location plot is fine
# Cook distance / leverage looks okay

# FINAL VALIDATION TASK - residuals against explanatory variable
```{r}
ggplot(fortify(mussel.lm2, Mussel), aes(AREA, .stdresid)) +
  geom_point() 
```
# This is better but still a slight hump
# Conclusion we accept the model

# plot the final model in ggplot...
```{r}
p <- ggplot(Mussel, aes(AREA,INDIV)) + 
  geom_point(col = "grey") +
  geom_smooth(method=lm, col = "black") + 
  annotate(geom = "text", x = 1000, y = 30,
           label = "log10(INDIV) = 0.835 + log10(AREA) - 0.576\n[Adj. R2 = 0.8108]", hjust = 0)  # adds text (NOTE \n means start new line)
p + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + # These commands scale the axes
  annotation_logticks(sides="lb") # This adds log10 tick marks
```
NOTE: the log axes use the Scales library [see above]

#Plot the graph add equations in base R
```{r}
plot(log10(INDIV) ~ log10(AREA), data = Mussel,
     xlab = "Log Mussel clump area (mm2)",
     ylab = "Log Number of Individuals",
     pch = 21, col = "grey", bg = "grey") # We've left the axes on
# add the regression line from the model (faithful.lm) using abline.....
abline(mussel.lm2, col="black")
# add the equation
text(4.0, 1.4, expression(paste(Log[10], "INDIV = 0.835", log[10], "AREA - 0.576", pos = 2)), cex = .7)
# add r-square value
text(4.0, 1.3, expression(paste(R^2 == 0.852)), cex = .7) # add in text using the expression/paste functions
# create a sequence of 1000 number spanning the range of humidities (min to max)
x <- seq(min(Mussel$AREA), max(Mussel$AREA), l=1000)  # notice this is an 'l' = length. NOT a 1!!!!!!!
#for each value of x, calculate the upper and lower 95% confidence
y<-predict(mussel.lm2, data.frame(AREA=x), interval="c")
#plot the upper and lower 95% confidence limits
matlines(log10(x),y, lty=3, col="black") # This function add the CIs, lty = line type (dashed) 
```
# add in ggplot code....
```{r}
p <- ggplot(Mussel, aes(x=log10(AREA), y=log10(INDIV))) + 
  geom_point(col = "blue") +
  geom_smooth(method=lm, col = "black") + 
  annotate(geom = "text", x = 2.5, y = 3,
           label = "Log10INDIV = 0.835 + log10AREA - 0.576\nAdj. R2 = 0.852",
           hjust = 0)
p + xlab("Log Mussel clump area (mm2)") + ylab("Log number of individuals")
```
# 3. ANCOVA on compensation dataset
# Just so you have a frame of reference going forward;
# the compensation data are about the production of fruit (apples, kg) 
# on rootstocks of different widths (mm; the tops are grafted onto rootstocks). 
# Furthermore, some trees are in parts of the orchard that allow grazing by cattle, 
# and others are in parts free from grazing. Grazing may reduce the amount of grass, 
# which might compete with the apple trees for resources.

# Load file
```{r}
Compensation <- read.csv(file.choose())
```
# look at data
```{r}
str(Compensation)
```

```{r}
Compensation$Grazing <- as.factor(Compensation$Grazing) # We need this to be a factor for an ANCOVA
```
Plot it
```{r}
# Plot it....
plot(Fruit ~ Root, data = Compensation, pch = 19,cex = 1.5,
     col = c("Black", "Red")[Compensation$Grazing],
     xlab = list("Root biomass (mm)", cex = 1.2),
     ylab = list("Fruit Production (kg)", cex = 1.2))
# Now add a legend
legend(5, 115, legend = c("Grazed", "Ungrazed"),
       col = c("black", "red"), pch = 19)
```

Or in coplot
```{r}
coplot(Root~ Fruit | Grazing, data =Compensation) # Grazed plants have higher root biomass. 

```
# You can do this is lattice too
```{r}
boxplot(Root ~ Grazing, data = Compensation) # These look fine
boxplot(Fruit ~ Grazing, data = Compensation)# These look fine
```
# You might also do some QQ-plots here.....or plot to check homogeneity of variances 
# but we'll leave that for validation....

# Run ANCOVA
```{r}
Compensation.lm <- lm(Fruit ~  Grazing*Root, data=Compensation) 
```
# Validate model to check assumptions
```{r}
autoplot(Compensation.lm)
```
# This looks pretty decent. Slight concern over QQ-plot but it's not a massive problem.

#Look at model outcomes
```{r}
anova(Compensation.lm)
```
# Anova shows that both Root and Fruit are significant terms
# But that the interaction isn't....

```{r}
summary(Compensation.lm) 
```
# Data indicate fruit differs by grazing 
# No interaction term i.e. slope for factor Grazing are the same
# Look at the example in the lecture 6 to figure out the intercept and slope differences, but here we go:

# The intercept (first row in the summary output) is the intercept for grazed samples because grazed (i.e G)
# comes before ungrazed (i.e. U) in the alphabet and R uses an alphanumeric call on factor data. 
# The 2nd line GrazingUngrazed is the difference in the intercepts. So Ungrazed is 30.806 larger. Meaning 30.8 more seeds are produced by grazed plants
# Calculate the intercept for Ungrazed
```{r}
-125.173  + 30.806 # = -94.367
```
# the third line is slope for root
# the fourth line GrazingUngrazed:Root is the difference in slopes between Grazed and Ungrazed = 0.756, which is tiny

# the coefficients show this....
```{r}
coef(Compensation.lm)
```
# Now lets add lines to the graphs. We can do it many ways - so here is another one!
# Route one - using abline
```{r}
plot(Fruit ~ Root, data = Compensation, pch = 19,cex = 1.5,
     col = c("Black", "Red")[Compensation$Grazing],
     xlab = list("Root biomass (mm)", cex = 1.2),
     ylab = list("Fruit production (kg)", cex = 1.2))
# Now add a legend
legend(5, 115, legend = c("Grazed", "Ungrazed"),
       col = c("black", "red"), pch = 19)
# add ablines....we'll do it manually using the slopes and intercepts but you can use coeff7() see below.
abline(-125.173, 23.240)
abline(-125.173+30.805, 23.240+0.756, col= "red")
```
# Plot using coefficients
```{r}
Compensation.lm$coeff[1] # regression coefficient for grazed intercept
Compensation.lm$coeff[2] # regression coefficient for grazed slope
Compensation.lm$coeff[3] # regression coefficient for diff in grazed/ungrazed intercept
Compensation.lm$coeff[4] # regression coefficient for diff in grazed/ungrazed slope
plot(Fruit ~ Root, data = Compensation, pch = 19,cex = 1.5,
     col = c("Black", "Red")[Compensation$Grazing],
     xlab = list("Root biomass (mm)", cex = 1.2),
     ylab = list("Fruit Production (kg)", cex = 1.2))
# Now add a legend
legend(5, 115, legend = c("Grazed", "Ungrazed"),
       col = c("black", "red"), pch = 19)
# add ablines....using coefficients - coeff()
abline(coef(Compensation.lm)[1], coef(Compensation.lm)[3]) 
# Now add the summer regression line
abline(coef(Compensation.lm)[1] + coef(Compensation.lm)[2],	
       coef(Compensation.lm)[3] + coef(Compensation.lm)[4],	
       col = "red")
```
# do some prediction....and plot....this is yet another way of doing it
#produce base plot 
```{r}
xs <- seq(0, 12, l = 100)
plot(Fruit ~ Root, data = Compensation, xlab = "Root biomass (mm)", ylab = "Fruit Production (kg)")
# Plot the points and predicted trends (using se)
points(Fruit ~ Root, data = Compensation, subset = Grazing == "Grazed", pch = 19)
pred <- predict(Compensation.lm, type = "response", se = TRUE, newdata = data.frame(Root = xs, Grazing = "Grazed", Root = mean(Compensation$Root)))
lines(pred$fit ~ xs)
points(Fruit ~ Root, data = Compensation, subset = Grazing == "Ungrazed", pch = 19, col = "red")
pred <- predict(Compensation.lm, type = "response", se = TRUE, newdata = data.frame(Root = xs, Grazing = "Ungrazed", Root = mean(Compensation$Root)))
lines(pred$fit ~ xs, col = "red")
legend("topleft", legend = c("Grazed", "Ungrazed"), pch = c(19, 19), col = c("red","black"), title = "Fruit Production", bty = "n")
box(bty = "l")
```
# Plot the points and predicted trends (using ggplot2). A fair bit easier.....NOTE - you'd need to tidy up the labels and the like...
```{r}
ggplot(Compensation, aes(x=Root, y=Fruit, color=factor(Grazing))) +
  geom_point() + 
  geom_smooth(method=lm)
```